Skip to content

Conversation

CloseChoice
Copy link
Contributor

@CloseChoice CloseChoice commented Oct 17, 2025

closes #7567

Add shift_rngs method to ExamplesIterable that is called directly after sharding. If a generator is available (not the case for all subclasses) we update the seed of the generator by shifting by the worker_id.

This is just the fix for shuffle, in the corresponding issue interleave_datasets is mentioned as well, which won't be fixed with this approach.

EDIT: This is a fix for shuffle and interleave_datasets. Adding recursivity to shift_rngs solved interleave_datasets as well. Not sure though if this is completely safe or if we could destroy something with that. I don't think so but could be wrong and appreciate some guidance from the maintainers. I also checked, on a single_worker we are always handing over index=0 so that case preserves the seed the user specified.

@CloseChoice CloseChoice changed the title Fix shuffle seed 7567 Fix random seed when shuffling Oct 17, 2025
@CloseChoice CloseChoice changed the title Fix random seed when shuffling Fix random seed when shuffling and interleaving_datasets Oct 17, 2025
@CloseChoice CloseChoice changed the title Fix random seed when shuffling and interleaving_datasets Fix random seed on shuffle and interleave_datasets Oct 17, 2025
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq
Copy link
Member

lhoestq commented Oct 20, 2025

Cool ! To avoid unwanted side effects it could be implemented for every class instead of using hasattr, and also return a new instance instead of modifying it in place. Some other functions work like that already like shuffle_data_sources and shard_data_sources.

Returning a new object is actually quite important otherwise iterating on the dataset multiple times would shift the RNGs every time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

interleave_datasets seed with multiple workers

3 participants